To evaluate the impact of Omitted Variable Bias (OVB) on different models, consider a system where oceanography drives site temperature and recruitment over time. Temperature also fluctuates over time within each site. Recruitment and temperature influence snail abundance, as do other uncorrelated drivers. You have conducted a study in this system measuring 10 sites, each site sampled once per year over 10 years. In this study, you have recorded snail abundance and temperature. But have no measure of recruitment. Note, the results of models will be the same if you had instead sampled in one year across 10 sites with 10 plots per site if there were plot-level drivers of temperature that behaved the same as below.
To parameterize our simulations, consider the following:
Thus, the system looks like this:
To simulated data, let’s begin by loading some libraries
library(tidyverse)
library(lme4)
library(broom)
library(broom.mixed)
library(DiagrammeR)
library(glue)
theme_set(theme_bw(base_size = 14))
Next, we need a function that will create a template of simulated sites based on oceanography and the sampling design described above.
make_environment <- function(n_sites = 10,
ocean_temp = 2,
temp_sd = 0,
ocean_recruitment = -2,
recruitment_sd = 0,
temp_mean = 15,
rec_mean = 10,
seed = NULL){
if(!is.null(seed)) set.seed(seed)
tibble(
site = as.character(1:n_sites)) %>%
mutate(
oceanography = rnorm(n_sites),
site_temp = temp_mean +
rnorm(n_sites, ocean_temp * oceanography, temp_sd),
site_recruitment = rec_mean +
rnorm(n_sites, ocean_recruitment * oceanography, recruitment_sd)
)
}
Great. Now, we need to add that year-to-year or plot-to-plot variability.
make_plots <- function(sites_df,
n_plots_per_site = 10,
plot_temp_sd = 1,
temp_effect = 1,
recruitment_effect = 1,
sd_plot = 1,
seed = NULL){
if(!is.null(seed)) set.seed(seed)
sites_df %>%
rowwise() %>%
mutate(
plot_temp_dev_actual = list(rnorm(n_plots_per_site,
0, plot_temp_sd))) %>%
unnest(plot_temp_dev_actual) %>%
mutate(
plot_temp = site_temp + plot_temp_dev_actual,
snails = rnorm(n(),
temp_effect*plot_temp +
recruitment_effect*site_recruitment,
sd_plot)) %>%
ungroup() %>%
group_by(site) %>%
mutate(year = 1:n(),
site_mean_temp = mean(plot_temp),
plot_temp_dev = plot_temp - site_mean_temp,
site_mean_snail = mean(snails),
site_snail_dev = snails - mean(snails),
delta_snails = snails - lag(snails),
delta_temp = plot_temp - lag(plot_temp)) %>%
ungroup()
}
To analyze our data, we will compare several different fit models.
analyze_plots <- function(plot_df){
m <- tribble(
~model_type, ~fit,
"Naive", lm(snails ~ plot_temp, data = plot_df),
"RE", lmer(snails ~ plot_temp + (1|site), data = plot_df),
"First Differences", lm(delta_snails ~ delta_temp,data = plot_df), #fix SE?
"FE with Dummy Variables", lm(snails ~ plot_temp + site, data = plot_df),
"FE Using Mean Differencing", lm(snails ~ plot_temp + site, data = plot_df), #fix SE?
"Group Mean Covariate", lmer(snails ~ plot_temp + site_mean_temp + (1|site), data = plot_df),
"Group Mean Centered", lmer(snails ~ plot_temp_dev + site_mean_temp + (1|site), data = plot_df),
"Group Mean Covariate, no RE", lm(snails ~ plot_temp + site_mean_temp, data = plot_df),
"Group Mean Centered, no RE", lm(snails ~ plot_temp_dev + site_mean_temp, data = plot_df)
) %>%
mutate(coefs = map(fit, tidy), #get coefficients with broom
out_stats = map(fit, glance),
temp_effect = map(coefs, get_temp_coef),
model_type = fct_inorder(model_type))
m
}
get_temp_coef <- function(a_tidy_frame){
a_tidy_frame %>%
filter(term %in% c("plot_temp", "plot_temp_dev", "delta_temp")) %>%
select(estimate, std.error)
}
Let’s begin by setting up 100 replicate simulations.
set.seed(31415)
n_sims <- 100
envt <- tibble(
sims = 1:n_sims
) %>%
mutate(sites = map(sims, ~make_environment()))
Just for a sanity check, here’s the relationship between temperature and recruitment at the site level across all simulations.
ggplot(envt %>%
unnest(sites),
aes(x = site_temp, y = site_recruitment, group = sims)) +
geom_point(alpha = 0.4, color = 'grey') +
stat_smooth(method = "lm", size = 0.5, alpha = 0.5,
fill = NA, color = "black") +
labs(x = "Site Average Temperature",
y = "Site Recruitment of Individuals Per Plot")
Great! Now, let’s setup our sampling over time.
plots_df <- envt %>%
mutate(site_year = map(sites, make_plots))
And, again, a sanity check…
ggplot(plots_df %>%
unnest(site_year),
aes(x = plot_temp, y = snails, color = site)) +
geom_point(alpha = 0.5)+
labs(x = "Plot Temperature",
y = "Snails per Plot")
So we can see that in this setup, the snail-temperature relationship is positive. But, how positive is it? What would our coefficients show?
Let’s fit models to each set of data
analysis_df <- plots_df %>%
mutate(analysis = map(site_year, analyze_plots)) %>%
unnest(analysis)
And now let’s look at the distribution of coefficients that would describe the relationship between temperature and snails from each mode.
analysis_df %>%
unnest(temp_effect) %>%
ggplot(aes(y = fct_rev(model_type), x = estimate)) +
ggridges::stat_density_ridges() +
labs(y="", x = "Model Estimated Temperature Effect")
Eyeballing it, we can see of course the naive model is too low. How bad is the bias for the RE model?
analysis_df %>%
unnest(temp_effect) %>%
group_by(`Model Type` = model_type) %>%
summarize(`Mean Estimate` = mean(estimate),
`SD Estimate` = sd(estimate)) %>%
knit_table
| Model Type | Mean Estimate | SD Estimate |
|---|---|---|
| Naive | 0.232 | 0.096 |
| RE | 0.869 | 0.159 |
| First Differences | 0.989 | 0.130 |
| FE with Dummy Variables | 0.996 | 0.109 |
| FE Using Mean Differencing | 0.996 | 0.109 |
| Group Mean Covariate | 0.996 | 0.109 |
| Group Mean Centered | 0.996 | 0.109 |
| Group Mean Covariate, no RE | 0.996 | 0.109 |
| Group Mean Centered, no RE | 0.996 | 0.109 |
The downward bias produced by poor model choice is clear. We can see it if we plot 1 minus the coefficient
analysis_df %>%
unnest(temp_effect) %>%
ggplot(aes(y = fct_rev(model_type), x = 1- estimate)) +
ggridges::stat_density_ridges() +
labs(y="", x = "Bias")
We can also look at the distribution of, for each simulated data set, how different is the RE model from each other model.
analysis_df %>%
filter(model_type != "Naive") %>%
unnest(temp_effect) %>%
group_by(sims) %>%
mutate(diff_from_re = estimate - estimate[1]) %>%
ungroup() %>%
filter(model_type != "RE") %>%
ggplot(aes(y = fct_rev(model_type), x = diff_from_re)) +
ggridges::stat_density_ridges() +
labs(y="", x = "Difference from RE Model\nTemperature Effect")
Note, in all cases, we can see the effects of downward bias. This is clear, but, to put it in numbers -
analysis_df %>%
filter(model_type != "Naive") %>%
unnest(temp_effect) %>%
group_by(sims) %>%
mutate(diff_from_re = estimate - estimate[1]) %>%
ungroup() %>%
filter(model_type != "RE") %>%
group_by(model_type) %>%
summarize(`Mean Diff from RE` = mean(diff_from_re),
`SD Diff from RE` = sd(diff_from_re)) %>%
knit_table
| model_type | Mean Diff from RE | SD Diff from RE |
|---|---|---|
| First Differences | 0.120 | 0.112 |
| FE with Dummy Variables | 0.128 | 0.097 |
| FE Using Mean Differencing | 0.128 | 0.097 |
| Group Mean Covariate | 0.128 | 0.097 |
| Group Mean Centered | 0.128 | 0.097 |
| Group Mean Covariate, no RE | 0.128 | 0.097 |
| Group Mean Centered, no RE | 0.128 | 0.097 |
If we were doing straight hypothesis testing, how often would our estimate of the temperature coefficient either overlap 0 or not have 1 within its confidence interval?
analysis_df %>%
filter(model_type != "Naive") %>%
unnest(temp_effect) %>%
mutate(overlap_0 = (estimate - 2*std.error)<0,
overlap_1 = (estimate - 2*std.error)<1 &
(estimate + 2*std.error)>1) %>%
group_by(`Model Type` = model_type) %>%
summarize(`95% CI Contains 0` = sum(overlap_0)/n(),
`95% CI does Not Contain 1` = (n()-sum(overlap_1))/n()) %>%
knit_table
| Model Type | 95% CI Contains 0 | 95% CI does Not Contain 1 |
|---|---|---|
| RE | 0.01 | 0.28 |
| First Differences | 0.00 | 0.12 |
| FE with Dummy Variables | 0.00 | 0.05 |
| FE Using Mean Differencing | 0.00 | 0.05 |
| Group Mean Covariate | 0.00 | 0.05 |
| Group Mean Centered | 0.00 | 0.05 |
| Group Mean Covariate, no RE | 0.00 | 0.03 |
| Group Mean Centered, no RE | 0.00 | 0.03 |
Here we see the RE model, while rarely, can be subject to type II error. Further, it is far more likely than any other technique to not have the true coefficient value within 2 CI of its estimand.
This has been useful, but, if we want to automate the process for further exploration, let’s wrap the code above into a function.
make_sims_and_analyze <- function(n_sims = 100,
n_sites = 10,
ocean_temp = 2,
temp_sd = 0,
ocean_recruitment = -2,
recruitment_sd = 0,
temp_mean = 15,
rec_mean = 10,
n_plots_per_site = 10,
plot_temp_sd = 1,
temp_effect = 1,
recruitment_effect = 1,
sd_plot = 1,
seed = NULL) {
#should we set a seed?
if (!is.null(seed))
set.seed(seed)
# make an envt data frame
out_df <- tibble(sims = 1:n_sims) %>%
mutate(
sites = map(
sims,
~make_environment(
n_sites = n_sites,
ocean_temp = ocean_temp,
temp_sd = temp_sd,
ocean_recruitment = ocean_recruitment,
recruitment_sd = recruitment_sd,
temp_mean = temp_mean,
rec_mean = rec_mean)
)
) %>%
#now add plots
mutate(
site_year = map(
sites,
make_plots,
n_plots_per_site = n_plots_per_site,
plot_temp_sd = plot_temp_sd,
temp_effect = temp_effect,
recruitment_effect = recruitment_effect,
sd_plot = sd_plot
)
) %>%
#and analysis
mutate(analysis = map(site_year, analyze_plots)) %>%
unnest(analysis)
}
If we look at the Group Mean Centered and Group Mean Centered model, what’s the RE?
analysis_df %>%
filter(model_type %in% c("Group Mean Covariate", "Group Mean Centered")) %>%
unnest(coefs) %>%
filter(group == "site") %>%
group_by(`Model Type` = model_type) %>%
summarize(Term = "Site Random Effect",
`Mean Site SD` = mean(estimate),
`SD in Site SD` = sd(estimate)) %>%
knit_table
| Model Type | Term | Mean Site SD | SD in Site SD |
|---|---|---|---|
| Group Mean Covariate | Site Random Effect | 0.257 | 0.179 |
| Group Mean Centered | Site Random Effect | 0.257 | 0.179 |
Both are the same - but both are not distinguishable from 0 (indeed, you’d see some warnings above if you changed the warnings option in this markdown).
Why is that?
In the system we have setup, there is no uncorrelated site-level variability. It’s all at the site-year (or site-plot) level. We have indeed run these models with no site random effect. They produce the same answers.
So should we just not worry about a site-level random effect? No. That’s because in the simulations above, we have not allowed for variation at the site level that is independent of oceanography. We can change this. In the simulation code above, we can add variation in recruitment or site mean temperature due to forces other than oceanography by tweaking their SD. If we do this with site temperature, this should merely aid in the estimation of our coefficient for the temperature effect, as it’s adding temperature variability that is uncorrelated with recruitment.
Instead, we can add non-oceanography-driven variation via recruitment. let’s assume that there is additional variation in average site recruitment that is uncorrelated with oceanography. This is one way site-level variability that is not correlated with oceanography. Now, using our new function that ties the room together, let’s create some new analyses where there is an additional SD of 1 in recruitment and look at the RE.
rec_var_df <- make_sims_and_analyze(recruitment_sd = 1)
You can now see that the temp-recruitment relationship has some wiggle.
ggplot(rec_var_df %>%
group_by(sims) %>%
slice(1L) %>%
ungroup() %>%
unnest(sites),
aes(x = site_temp, y = site_recruitment, group = sims)) +
geom_point(alpha = 0.4, color = 'grey') +
stat_smooth(method = "lm", size = 0.5, alpha = 0.5,
fill = NA, color = "black")+
labs(x = "Site Average Temperature",
y = "Site Recruitment of Individuals Per Plot")
The models with omitted variable bias still outperform those without.
rec_var_df %>%
unnest(temp_effect) %>%
group_by(`Model Type` = model_type) %>%
summarize(`Mean Estimate` = mean(estimate),
`SD Estimate` = sd(estimate)) %>%
knit_table
| Model Type | Mean Estimate | SD Estimate |
|---|---|---|
| Naive | 0.234 | 0.183 |
| RE | 0.896 | 0.113 |
| First Differences | 0.990 | 0.129 |
| FE with Dummy Variables | 0.984 | 0.102 |
| FE Using Mean Differencing | 0.984 | 0.102 |
| Group Mean Covariate | 0.984 | 0.102 |
| Group Mean Centered | 0.984 | 0.102 |
| Group Mean Covariate, no RE | 0.984 | 0.102 |
| Group Mean Centered, no RE | 0.984 | 0.102 |
But the difference is not as pronounced due to the weakening of the correlation between temperature and recruitment. However, the site RE is now clearly not 0.
rec_var_df %>%
filter(model_type %in% c("Group Mean Covariate", "Group Mean Centered")) %>%
unnest(coefs) %>%
filter(group == "site") %>%
group_by(`Model Type` = model_type) %>%
summarize(Term = "Site Random Effect",
`Mean Site SD` = mean(estimate),
`SD in Site SD` = sd(estimate)) %>%
knit_table
| Model Type | Term | Mean Site SD | SD in Site SD |
|---|---|---|---|
| Group Mean Covariate | Site Random Effect | 0.999 | 0.334 |
| Group Mean Centered | Site Random Effect | 0.999 | 0.334 |
This means that we can now look at residual variation due to between site differences as well as residual replicate-level variation. Further, while models without a site random effect can be fit and used, these models will be structurally incorrect - replicates are not IID, as there is correlated error within sites.
Practically, the difference comes in with respect to whether you are interested in between site variation or not. The residual inflates with no RE, as it is now the combination of between site residual and within site residual terms. This could make a difference for various statistical tests down the line as well, but in terms of parameter estimates, we are still estimating a clean causal effect.
rec_var_df %>%
filter(grepl("Group Mean", model_type)) %>%
unnest(out_stats) %>%
group_by(`Model Type` = model_type) %>%
summarize(`Mean Residual SD` = mean(sigma),
`SD in Residual SD` = sd(sigma)) %>%
ungroup() %>%
knit_table
| Model Type | Mean Residual SD | SD in Residual SD |
|---|---|---|
| Group Mean Covariate | 1.007 | 0.065 |
| Group Mean Centered | 1.007 | 0.065 |
| Group Mean Covariate, no RE | 1.376 | 0.199 |
| Group Mean Centered, no RE | 1.376 | 0.199 |
Apply cluster robust standard errors to models without random effects!!!
library(fixest)
fixest_df <- rec_var_df %>%
group_by(sims) %>%
slice(1L) %>%
ungroup() %>%
select(sims, sites, site_year) %>%
mutate(fixest_fit = map(site_year,
~feols(snails ~ plot_temp | site,
cluster = "site",
data = .x)),
first_diff_fit = map(site_year,
~feols(delta_snails ~ delta_temp,
cluster = "site",
data = .x))
) %>%
pivot_longer(cols = c(fixest_fit, first_diff_fit),
names_to = "model_type",
values_to = "fit") %>%
mutate(coefs = map(fit, tidy))
One of the advantages to mixed modeling approaches is how they handle unbalanced data. What if, in the above example, we had lost samples from each site generating unbalanced data?
set.seed(31415)
#what do we keep?
years_to_keep <- round(runif(10, 3, 10)) %>%
imap_dfr( ~ tibble(site = as.character(.y),
year = sample(1:10, .x),
keep = T))
#a function for unbalancing
unbalance <- function(site_year_df, keep_df){
left_join(site_year_df, keep_df) %>%
filter(keep)
}
#apply keeping to each site-year
unbalanced_df <- rec_var_df %>%
select(-model_type, -fit, -coefs, -out_stats, -temp_effect) %>%
mutate(site_year = map(site_year, unbalance,
keep_df = years_to_keep),
analysis = map(site_year, analyze_plots)) %>%
unnest(analysis)
How does this lack of balance affect estimation of causal coefficients?
unbalanced_df %>%
unnest(temp_effect) %>%
group_by(`Model Type` = model_type) %>%
summarize(`Mean Estimate` = mean(estimate),
`SD Estimate` = sd(estimate)) %>%
knit_table
| Model Type | Mean Estimate | SD Estimate |
|---|---|---|
| Naive | 0.237 | 0.205 |
| RE | 0.834 | 0.148 |
| First Differences | 0.989 | 0.148 |
| FE with Dummy Variables | 0.984 | 0.124 |
| FE Using Mean Differencing | 0.984 | 0.124 |
| Group Mean Covariate | 0.982 | 0.125 |
| Group Mean Centered | 0.982 | 0.125 |
| Group Mean Covariate, no RE | 0.977 | 0.139 |
| Group Mean Centered, no RE | 0.977 | 0.139 |
We can see that point estimation here is improved somewhat. Although this could vary. More telling would be an exporation of those group means in the mixed versus fixed models.